{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Testing if a Distribution is Normal"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import scipy.stats as stats\n",
    "import matplotlib.pyplot as plt\n",
    "import quiz_tests\n",
    "\n",
    "# Set plotting options\n",
    "%matplotlib inline\n",
    "plt.rc('figure', figsize=(16, 9))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create normal and non-normal distributions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sample A: Normal distribution\n",
    "sample_a = stats.norm.rvs(loc=0.0, scale=1.0, size=(1000,))\n",
    "\n",
    "# Sample B: Non-normal distribution\n",
    "sample_b = stats.lognorm.rvs(s=0.5, loc=0.0, scale=1.0, size=(1000,))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Boxplot-Whisker Plot and Histogram\n",
    "\n",
    "We can visually check if a distribution looks normally distributed.  Recall that a box whisker plot lets us check for symmetry around the mean.  A histogram lets us see the overall shape.  A QQ-plot lets us compare our data distribution with a normal distribution (or any other theoretical \"ideal\" distribution)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sample A: Normal distribution\n",
    "sample_a = stats.norm.rvs(loc=0.0, scale=1.0, size=(1000,))\n",
    "fig, axes = plt.subplots(2, 1, figsize=(16, 9), sharex=True)\n",
    "axes[0].boxplot(sample_a, vert=False)\n",
    "axes[1].hist(sample_a, bins=50)\n",
    "axes[0].set_title(\"Boxplot of a Normal Distribution\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sample B: Non-normal distribution\n",
    "sample_b = stats.lognorm.rvs(s=0.5, loc=0.0, scale=1.0, size=(1000,))\n",
    "fig, axes = plt.subplots(2, 1, figsize=(16, 9), sharex=True)\n",
    "axes[0].boxplot(sample_b, vert=False)\n",
    "axes[1].hist(sample_b, bins=50)\n",
    "axes[0].set_title(\"Boxplot of a Lognormal Distribution\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q-Q plot of normally-distributed sample\n",
    "plt.figure(figsize=(10, 10)); plt.axis('equal')\n",
    "stats.probplot(sample_a, dist='norm', plot=plt);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q-Q plot of non-normally-distributed sample\n",
    "plt.figure(figsize=(10, 10)); plt.axis('equal')\n",
    "stats.probplot(sample_b, dist='norm', plot=plt);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Testing for Normality\n",
    "### Shapiro-Wilk\n",
    "\n",
    "The Shapiro-Wilk test is available in the scipy library.  The null hypothesis assumes that the data distribution is normal.  If the p-value is greater than the chosen p-value, we'll assume that it's normal. Otherwise we assume that it's not normal.\n",
    "https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.shapiro.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "def is_normal(sample, test=stats.shapiro, p_level=0.05, **kwargs):\n",
    "    \"\"\"Apply a normality test to check if sample is normally distributed.\"\"\"\n",
    "    t_stat, p_value = test(sample, **kwargs)\n",
    "    print(\"Test statistic: {}, p-value: {}\".format(t_stat, p_value))\n",
    "    print(\"Is the distribution Likely Normal? {}\".format(p_value > p_level))\n",
    "    return p_value > p_level\n",
    "\n",
    "# Using Shapiro-Wilk test (default)\n",
    "print(\"Sample A:-\"); is_normal(sample_a);\n",
    "print(\"Sample B:-\"); is_normal(sample_b);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Kolmogorov-Smirnov\n",
    "\n",
    "The Kolmogorov-Smirnov is available in the scipy.stats library.  The K-S test compares the data distribution with a theoretical distribution.  We'll choose the 'norm' (normal) distribution as the theoretical distribution, and we also need to specify the mean and standard deviation of this theoretical distribution.  We'll set the mean and stanadard deviation of the theoretical norm with the mean and standard deviation of the data distribution.\n",
    "\n",
    "https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.kstest.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Quiz\n",
    "\n",
    "To use the Kolmogorov-Smirnov test, complete the function `is_normal_ks`.\n",
    "\n",
    "To set the variable normal_args, create a tuple with two values.  An example of a tuple is `(\"apple\",\"banana\")`\n",
    "The first is the mean of the sample. The second is the standard deviation of the sample.\n",
    "\n",
    "**hint:** Hint: Numpy has functions np.mean() and np.std()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def is_normal_ks(sample, test=stats.kstest, p_level=0.05, **kwargs):\n",
    "    \"\"\"\n",
    "    sample: a sample distribution\n",
    "    test: a function that tests for normality\n",
    "    p_level: if the test returns a p-value > than p_level, assume normality\n",
    "    \n",
    "    return: True if distribution is normal, False otherwise\n",
    "    \"\"\"\n",
    "    normal_args = \n",
    "    \n",
    "    t_stat, p_value = test(sample, 'norm', normal_args, **kwargs)\n",
    "    print(\"Test statistic: {}, p-value: {}\".format(t_stat, p_value))\n",
    "    print(\"Is the distribution Likely Normal? {}\".format(p_value > p_level))\n",
    "    return p_value > p_level\n",
    "\n",
    "quiz_tests.test_is_normal_ks(is_normal_ks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using Kolmogorov-Smirnov test\n",
    "print(\"Sample A:-\"); is_normal_ks(sample_a);\n",
    "print(\"Sample B:-\"); is_normal_ks(sample_b);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're stuck, you can also check out the solution [here](test_normality_solution.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
